install.packages("nycflights13")
Error in install.packages : Updating loaded packages
Change “your name” in the YAML header above to your name.
As usual, enter the examples in code chunks and run them, unless told otherwise.
Read R4ds Chapter 10: Tibbles, sections 1-3.
Load the tidyverse package.
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.3.2 v purrr 0.3.4
v tibble 3.0.2 v dplyr 1.0.0
v tidyr 1.1.0 v stringr 1.4.0
v readr 1.3.1 v forcats 0.5.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Enter your code chunks for Section 10.2 here.
Describe what each chunk code does.
Create a tibble from a data frame
as_tibble(iris)
Tibble with individual vectors
tibble(x = 1:5, y = 1, z = x ^ 2 + y)
Use of backticks to create column names
tb <- tibble(`:)` = "smile",` ` = "space",`2000` = "number")
tb
Tribble
tribble(~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5)
Enter your code chunks for Section 10.3 here.
Describe what each chunk code does.
Tibble print
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE))
Explicit tibble print
nycflights13::flights %>%
print(n = 10, width = Inf)
View nycflights data set
nycflights13::flights %>%
View()
create a data frame
df <- tibble(x = runif(5),y = rnorm(5))
Extract a variable by name
df$x
[1] 0.28643206 0.03715773 0.81922921 0.97439266 0.77364389
Extract a variable by name
df[["x"]]
[1] 0.28643206 0.03715773 0.81922921 0.97439266 0.77364389
Extract a variable by position
df[[1]]
[1] 0.28643206 0.03715773 0.81922921 0.97439266 0.77364389
Extract a variable by name with use of pipe
df %>% .$x
[1] 0.28643206 0.03715773 0.81922921 0.97439266 0.77364389
Extract a variable by position with a pipe
df %>% .[["x"]]
[1] 0.28643206 0.03715773 0.81922921 0.97439266 0.77364389
Answer the questions completely. Use code chunks, text, or both, as necessary.
1: How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame). Identify at least two ways to tell if an object is a tibble. Is an object is a tibble then only the first 10 observations will print.You may also use the is_tibble() function to determine if an object is a tibble. Hint: What does as_tibble() do? Turns an exisiting dataset into a tibble. What does class() do? Identifys the class of an object. What does str() do? reports the basic structure of an object
mtcars
2: Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Both are means of calling out data, however the tibble option requires fewer keystrokes. Why might the default data frame behaviours cause you frustration? More keystrokes are required.
df <- data.frame(abc = 1, xyz = "a")
df$x
[1] "a"
df[, "xyz"]
[1] "a"
df[, c("abc", "xyz")]
Read R4ds Chapter 11: Data Import, sections 1, 2, and 5.
Nothing to do here unless you took a break and need to reload tidyverse.
Do not run the first code chunk of this section, which begins with heights <- read_csv("data/heights.csv"). You do not have that data file so the code will not run.
Enter and run the remaining chunks in this section.
Produces a inline csv file
read_csv("a,b,c
1,2,3
4,5,6")
Create a CSV file but skip the first lines of data
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
Create a csv file and skip a comment
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
create a csv file that doesn’t have column names on data
read_csv("1,2,3\n4,5,6", col_names = FALSE)
create csv file and assign column names a vector
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
creaye csv file and add na to missing data
read_csv("a,b,c\n1,2,.", na = ".")
1: What function would you use to read a file where fields were separated with “|”? read_delim()
2: (This question is modified from the text.) Finish the two lines of read_delim code so that the first one would read a comma-separated file and the second would read a tab-separated file. You only need to worry about the delimiter. Do not worry about other arguments. Replace the dots in each line with the rest of your code.
file <- read_delim("file.csv", read_csv())
file <- read_delim("file.csv", read_tsv())
3: What are the two most important arguments to read_fwf()? Why? Width or position, it allows the reading files with a large amount of white space
4: Skip this question
5: Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6")
2 parsing failures.
row col expected actual file
1 -- 2 columns 3 columns literal data
2 -- 2 columns 3 columns literal data
read_csv("a,b,c\n1,2\n1,2,3,4")
2 parsing failures.
row col expected actual file
1 -- 3 columns 2 columns literal data
2 -- 3 columns 4 columns literal data
read_csv("a,b\n\"1")
2 parsing failures.
row col expected actual file
1 a closing quote at end of file literal data
1 -- 2 columns 1 columns literal data
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")
read_csv(“a,b1,2,34,5,6”) - Only two columns are provided, so some data is lost
read_csv(“a,b,c1,21,2,3,4”) - only 3 column names are provided so data is lost
read_csv(“a,b"1”)- Quotation marks are not closed
read_csv(“a,b1,2,b”) - ?
read_csv(“a;b1;3”) - read_csv() works with commas, doesn’t recognize semicolons
Just read this section. You may find it helpful in the future to save a data file to your hard drive. It is basically the same format as reading a file, except that you must specify the data object to save, in addition to the path and file name.
Read R4ds Chapter 18: Pipes, sections 1-3.
Nothing to do otherwise for this chapter. Is this easy or what?
Note: Trying using pipes for all of the remaining examples. That will help you understand them.
Read R4ds Chapter 12: Tidy Data, sections 1-3, 7.
Nothing to do here unless you took a break and need to reload the tidyverse.
Study Figure 12.1 and relate the diagram to the three rules listed just above them. Relate that back to the example I gave you in the notes. Bear this in mind as you make data tidy in the second part of this assignment.
You do not have to run any of the examples in this section.
Read and run the examples through section 12.3.1 (gathering), including the example with left_join(). We’ll cover joins later. Table 4a dataset
table4a
table 4a dataset renaming columns
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to
="cases")
Tidying data into cells
table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
Tidy table 4a
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
Tidy table 4b
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
Left join table4a to table 4b
left_join(tidy4a, tidy4b)
Joining, by = c("country", "year")
Load table 2
table2
Using pivot wider to create a new column
table2 %>%
pivot_wider(names_from = type, values_from = count)
2: Why does this code fail? Pivot_longer was omitted, so were quotations marks around 1999 and 2000.Fix it so it works.
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
NA
That is all for Chapter 12. On to the last chapter.
Read R4ds Chapter 5: Data Transformation, sections 1-4.
Time to get small.
Load the necessary libraries. As usual, type the examples into and run the code chunks.
library(tidyverse)
library(nycflights13)
Loading flights
flights
filter()Study Figure 5.1 carefully. Once you learn the &, |, and ! logic, you will find them to be very powerful tools.
Filter flights by day and time
filter(flights, month == 1, day == 1)
Save the results of 01/01
jan1 <- filter(flights, month == 1, day == 1)
Save and print the results of flights on 12/25
(dec25 <- filter(flights, month == 12, day == 25))
Not using == error
filter(flights, month = 1)
Error: Problem with `filter()` input `..1`.
x Input `..1` is named.
i This usually means that you've used `=` instead of `==`.
i Did you mean `month == 1`?
foating number results
sqrt(2) ^ 2 == 2
[1] FALSE
Use of near()
near(sqrt(2) ^ 2, 2)
[1] TRUE
Use of near()
near(1 / 49 * 49, 1)
[1] TRUE
All flights that departed in november or december
filter(flights, month == 11 | month == 12)
Shorthand to find all november and december flights
nov_dec <- filter(flights, month %in% c(11, 12))
FLights that weren’t delayed by more than 2 hours
filter(flights, !(arr_delay > 120 | dep_delay > 120))
Flights that weren’t delyaed by more than two hours
filter(flights, arr_delay <= 120, dep_delay <= 120)
Creating dataframe
df <- tibble(x = c(1, NA, 3))
Apply filter()
filter(df, x > 1)
Apply filter
filter(df, is.na(x) | x > 1)
1.1: Find all flights with a delay of 2 hours or more.
filter(flights, dep_delay >= 120)
1.2: Flew to Houston (IAH or HOU)
filter(flights, dest == "IAH" | dest == "HOU")
1.3: Were operated by United (UA), American (AA), or Delta (DL).
filter(flights, carrier == "UA"|carrier == "AA"|carrier == "DL")
1.4: Departed in summer (July, August, and September).
filter(flights, month == "7"|month == "8"|month == "9")
1.5: Arrived more than two hours late, but didn’t leave late.
filter(flights, dep_delay == 0 & arr_delay >= 120)
1.6: Were delayed by at least an hour, but made up over 30 minutes in flight. This is a tricky one. Do your best.
filter(flights, dep_delay >= 60 & arr_delay<=30)
1.7: Departed between midnight and 6am (inclusive)
filter(flights, dep_time >= 0000 & dep_time <=600)
2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges? Between is a shortcut for less than & equal to and greater than and equal to
1.7 could have been shortcutted by
filter(flights, between(dep_time, 0 , 600))
3: How many flights have a missing dep_time? 8255 What other variables are missing? Arrival time, arrival delay, departure delay, & air time. What might these rows represent? Most likely, the flights never left.
sum(is.na(flights$dep_time))
[1] 8255
filter(flights, is.na(dep_time))
4: Why is NA ^ 0 not missing? Na raised to the power of zero is a value, zero. Why is NA | TRUE not missing? anything ‘or true’ is always true. Why is FALSE & NA not missing?Anything ‘and false’ is always false.
Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
Note: For some context, see this thread
arrange()Arrange flights
arrange(flights, year, month, day)
Reorder columns by descending order
arrange(flights, desc(dep_delay))
Create dataframe
df <- tibble(x = c(5, 2, NA))
Sort missing values
arrange(df, x)
Sort missing values
arrange(df, x)
1: How could you use arrange() to sort all missing values to the start? (Hint: use is.na()). Note: This one should still have the earliest departure dates after the NAs. Hint: What does desc() do?
arrange(flights, desc(is.na(dep_delay)))
2: Sort flights to find the most delayed flights. Find the flights that left earliest. Most delayed flights
arrange(flights, desc(dep_delay))
Flights that left the earliest
arrange(flights, dep_delay)
This question is asking for the flights that were most delayed (left latest after scheduled departure time) and least delayed (left ahead of scheduled time).
3: Sort flights to find the fastest flights. Interpret fastest to mean shortest time in the air.
arrange(flights, air_time)
Optional challenge: fastest flight could refer to fastest air speed. Speed is measured in miles per hour but time is minutes. Arrange the data by fastest air speed.
4: Which flights travelled the longest? Which travelled the shortest? Flights that traveled the longest
arrange(flights, desc(distance))
library(tidyverse)
library(nycflights13)
Flights that traveled the shortest distance
arrange(flights, distance)
select()Select columns by name
select(flights, year, month, day)
Select all columns between day and year
select(flights, year:day)
Select all columns except year to day
select(flights, -(year:day))
NA
Rename variables
rename(flights, tail_num = tailnum)
Move variables to start of dataframe
select(flights, time_hour, air_time, everything())
1: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights. Find at least three ways. Use the select function
select(flights, dep_time, dep_delay, arr_time, arr_delay)
2- Use starts_with()
select(flights, starts_with('dep'), starts_with('arr'))
3- use contains()
select(flights, contains('delay'), contains('time'))
2: What happens if you include the name of a variable multiple times in a select() call?
The variables you repeat will be omitted
3: What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
One_of() allows you to select parts of the dataframe. To only call out certain variables.
4: Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
select(flights, contains("TIME"))
Yes, surprising because R is case sensitive, however contains() is apparently not. To change the default, ignore_case must be added: